Domain-specific transfer learning in the automated scoring of tumor-stroma ratio from histopathological images of colorectal cancer

Tumor-stroma ratio (TSR) is a prognostic factor for many types of solid tumors. In this study, we propose a method for automated estimation of TSR from histopathological images of colorectal cancer. The method is based on convolutional neural networks which were trained to classify colorectal cancer tissue in hematoxylin-eosin stained samples into three classes: stroma, tumor and other. The models were trained using a data set that consists of 1343 whole slide images. Three different training setups were applied with a transfer learning approach using domain-specific data i.e. an external colorectal cancer histopathological data set. The three most accurate models were chosen as a classifier, TSR values were predicted and the results were compared to a visual TSR estimation made by a pathologist. The results suggest that classification accuracy does not improve when domain-specific data are used in the pre-training of the convolutional neural network models in the task at hand. Classification accuracy for stroma, tumor and other reached 96.1% on an independent test set. Among the three classes the best model gained the highest accuracy (99.3%) for class tumor. When TSR was predicted with the best model, the correlation between the predicted values and values estimated by an experienced pathologist was 0.57. Further research is needed to study associations between computationally predicted TSR values and other clinicopathological factors of colorectal cancer and the overall survival of the patients.


Introduction
Deep learning (DL) has been the state-of-the-art medical image analysis technology for the last decade. It has been applied to various tasks also in digital pathology, e.g., tissue classification between normal and tumor tissues, defining tumor subtype, recognition (e.g. dividing cells) and segmentation (patch-or pixel-level segmentation) [1]. Tasks other than purely morphological have also been carried out, such as training DL models to predict certain genetic changes from hematoxylin-eosin (H&E)-stained histopathological sections without e.g. immunohistochemical staining [2][3][4][5][6]. These are relevant and meaningful efforts because the aforementioned tasks are time-consuming and expensive when carried out using manual laboratory methods [7].
A common challenge for developing artificial intelligence methods lies in the insatiable data hunger of DL algorithms. The lack of annotated data is also one of the most significant challenges for digital pathology [8]. Gaining a sufficient amount of high-quality training and testing data means hours of work for pathologists to annotate regions of interest to digitized images. One way to alleviate the data scarcity problem could be, e.g., transfer learning.
Transfer learning means utilizing a pre-trained neural network that has already learned a machine learning task in some domain that is not necessarily the same as in the target application. Transfer learning can be accomplished, e.g. using pre-trained ImageNet [9] neural network architecture that will provide the initial parameter values for the model. Although ImageNet model has been trained with images representing dogs, planes and houses, that are essentially very dissimilar to histopathological images, initializing weights of the convolutional neural network (CNN) model with ImageNet has been shown to increase prediction accuracy in medical imaging tasks [1,10]. Sometimes a pre-trained model can also be available from the target domain. Some studies have shown that such a domain-specific pre-trained model may further improve prediction performance in the context of histopathological image analysis when compared to the ImageNet-initialization [10,11]. In the medical domain, the deeper models have been shown to perform better as a feature extractor compared to shallow and linear models [12]. On the other hand, also lightweight models have performed well when transfer learning is applied, as seen in a study by Zhang et al. [13]. ImageNet-based transfer learning is a common approach in digital pathology in general [9,14,15], and it is also the most common transfer learning approach in DL models trained with colorectal cancer (CRC) histopathological images [16].
This study focuses on automating the estimation of TSR from histopathological images of CRC using transfer learning. CRC is the second most death-causing cancer in the world with over 900,000 deaths every year [17]. One prognostic factor for CRC is the proportion of stroma within the tumor site. It has been shown to associate with the survival of the patient in many solid cancer types. The low amount of stroma (TSR � 50%) associates with a better prognosis [18][19][20][21].
Pathologists determine TSR visually by following a certain scoring protocol [22]. The main problem of this approach is the reproducibility of the TSR scoring. Overview by Van Pelt et al. [22] showed that the Cohen's kappa scores measuring the inter-observer agreement of visual TSR using binary scoring (TSR > 50% = stroma-high and TSR � 50% = stroma-low) ranged from 0.60 to 0.89. Automated TSR estimation may improve reproducibility, but it is a challenging task and a new relatively new concept.
Automated estimation of TSR begins by tiling a histopathological whole-slide image (WSI) into smaller image patches. After that TSR can be predicted by classifying the patches and calculating the proportion of tumor and stroma. Another approach is to use a particular spot that is a smaller part of the WSI selected by a pathologist to calculate TSR. Both approaches have been applied in previous studies [23][24][25][26][27][28].
Sirinukuvattana et al. [23,24] automated the TSR estimation using a CNN model trained for nuclei detection. They trained a model based on VGG19 [29] architecture to classify nine tissue types and the accuracies for detecting stroma and tumor were 90.4% and 96.0%, respectively. In contrast to other TSR-related studies, their results did not show prognostic value for TSR [25][26][27].
Zhao et al. [26] proposed a nine-class CNN model using transfer learning for which overall classification accuracies on two test sets were 95.7% and 97.5%. Classification accuracies for tumor and stroma were 92.8% and 70.9% on the test set 1 that was published by Kather et al. [30]. The test set 2 was a random sample from images collected at Yunnan Cancer Hospital. Classification accuracies for tumor and stroma on the test set 2 were 97.2% and 89.1%. Zhao et al. used pathologists annotations as ground truth for TSR estimation on 126 image blocks of size 1 μm 2 , the predicted tumor and stroma areas were in high agreement with pathologists' annotations (Pearson r = 0.939, 95% CI 0.914-0.957). When splitting the patient cohort into two categories stroma-low (TSR < 48.8%) and stroma-high (TSR�48.8%) based on the TSR results of their model, TSR was shown to be an independent prognostic factor in the overall survival of CRC patient.
Instead of using the whole slide as an input for a neural network model, an alternative approach is to calculate TSR from the same circular spot where the visual TSR estimation takes place. The downside of this approach is the manual effort needed to point the spot for the machine learning model. Geessink et al. [28] developed an 11-layer VGG neural network on 129 patients to classify nine tissue types. Using 50% cutoff-value between stroma-high and stroma-low categories there was a considerable disagreement between the model and pathologist (Cohen's kappa κ = 0.239). Cohen's kappa score was slightly improved (κ = 0.521) using the median of the TSR values estimated with the model as a cutoff value for stroma-high and stroma-low. Also, stroma-high-and stroma-low grouping showed a strong prognostic value when the median was used as a cutoff value.
In the present study, 12 DL models for estimating TSR for CRC samples were developed and tested using three different transfer learning setups, four different pre-trained CNN architectures, and two distinct CRC data sets. Models were trained to classify image patches into three different tissue classes: tumor, stroma and other. Three best-performing CNN models were chosen to estimate TSR from WSIs in a separate TSR test set. The predicted TSR values were then compared with the pathologist's TSR estimates.

Data
In this study, two mutually independent CRC data sets were used: a CRC data set from Central Finland Health Care District ("CFHCD-data") and a public CRC data set ("NCT-CRC-HE-100K") [30].
CFHCD-data consists of 1343 patients of primary colorectal cancers (stages I-IV) with one WSI from each patient. The cohort is described in detail by Elomaa et al. [31]. The WSIs were scanned with Hamamatsu NanoZoomer-XR with resolution of 0.5 microns per pixel (MPP).
Automated estimation of the TSR-value for a single WSI is based on calculating the ratio of patches representing stroma and tumor classes. The estimation process consists of two steps: 1) CNN-classification of patches for a WSI 2) Calculation of the ratio of tumor and stroma patches.
For developing and testing the models, CFHCD-data were split into two disjoint sets at random. The first one (Dev-set) consisted of 169 WSIs and was used in developing CNN classification models and the second one (TSR-test set) consisted of 1174 WSIs and it was applied to test CNN-based estimation of the TSR-value. The overall study flow is presented in Fig 1. Before tiling and preprocessing, Dev-set was annotated by an experienced pathologist into three different tissue categories, tumor, stroma and other. The annotations were accomplished with QuPath image analysis tool [32]. The class other includes debris, lymphocytes, mucus, normal epithelium and smooth muscle (See Fig 2). Ground truth TSR-values (from now on referred to as true TSR-values) were visually assessed by the pathologist for the TSR-test set following the protocol by Van Pelt et al. [22].
A subsample (n = 31,337) of NCT-CRC-HE-100K data set were used to investigate the effects of domain-specific pre-training of the CNN models. The original data includes 100,000 patches tiled from 86 WSIs (0.5 MPP). The patches are pre-labeled to nine tissue classes: adipose, debris, lymphocytes, mucus, normal, smooth muscle, stroma and tumor. In this study, only patches representing stroma (10,446 patches) and tumor (10,446 patches) classes were used as such, whereas a random sample of 10,445 patches from debris, lymphocytes, mucus, normal and smooth muscle classes with even distribution was drawn and assigned to a class called other.

Tiling and preprocessing of WSIs
Following the annotation of the WSIs in Dev-set they were tiled into smaller WSI patches (224 × 224 pixels / 101 μm × 101 μm). Tiling was performed with a sliding window procedure with 64 pixels overlapping. If less than 75% of the patch area included annotated pixels, the patch was discarded from further analyses. All patches were color normalized by Macenko's method [33].
Since the classifiers were not trained to detect the image background and adipose tissue, the WSIs in the TSR-test set were tiled in the following way: First, to remove image background and adipose tissue, a binary mask was applied using Otsu's algorithm for choosing the optimal threshold value [34]. After this the image patches were tiled using a sliding window procedure with no overlapping. If the masked area was less than 75%, the patch was discarded from further analyses. Macenko's method was applied for color normalization of the patches [33].

Training and evaluation of classifiers
For training and selecting the best CNN models Dev-set was split at random into two distinct WSI sets of CNN-train (n = 152) and CNN-test (n = 17) with the constraints that after the tiling process the maximum difference of eighty patches in the training distribution of the target classes was allowed and the minimum number of patches was nine hundred in each test class. The constraints were applied to ensure approximately balanced class distribution in CNN-train and that the size of CNN-test is at least 10% of Dev-set.
This resulted in 24,329 and 2,770 input patches for training and testing of the classifiers, respectively. The final numbers of patches in the CNN-train/CNN-test classes were 8139/900, 8060/900, and 8130/900 for tumor, stroma and other, respectively. Examples of image patches from each class are shown in Fig 3. For predicting TSR in WSIs, four convolutional neural network (CNN) architectures, Alex-Net [35], GoogleNet [36], ResNet50 [37], VGG19 [29], were trained on CNN-train to classify the patches into three tissue types: tumor, stroma and other. For all the CNN models, the input layer was set according to the WSI patch dimensions 224 × 224, and the output layer was softmax function (dimension = 3).
Three different pre-training strategies for CNN models were applied (SETUP-1, SETUP-2 and SETUP-3). In SETUP-1 the CNN model was first initialized with ImageNet [9] weights and thereafter pre-trained with the domain-specific subsample of NCT-CRC-HE-100K-data before finalizing the training with CNN-train. In SETUP-2 training on CNN-train started directly from ImageNet weights. In SETUP-3 the CNNs were first initialized with random weights using PyTorch default parameters and then subsequently pre-trained with the subsample of NCT-CRC-HE-100K-data and finalized with CNN-train.
For both pretraining the CNN models on the NCT-CRC-HE-100K patches as well as finetuning the models on CFHCD-data (CNN-train and CNN-test), the most suitable values for hyperparameters were selected using 5-fold cross-validation (C-V) CNN-train. In 5-fold C-V data are split at random into the five subsets. Then each subset acts once as a test fold (set) while the four other are used for training. The error estimate for a model is the average error over the test folds.
After finding the best hyperparameter values for the neural networks, the optimal number of epochs was determined with an early stopping model selection strategy on CNNtrain using approximately 1/3 of the training patches for validation. The final model was then trained once more on the full CNN-train until the optimal number of epochs was reached. The generalization performance of each CNN model was then assessed by computing the classification accuracy, precision, recall and F 1 -score of the models on CNN-test (2,770 CFHCD-patches).
For more information about chosen hyperparameters, see S1 Table. Calculating tumor-stroma ratio TSR was calculated for each WSI in TSR-test as the ratio of patches classified by a CNN model as stroma and tumor: where n stroma and n tumor are the number of stroma and tumor patches, respectively.

Equipment and software
All the neural network models were trained on Linux GPU server Tesla P100, x 86_64 with Python-version 3.

Validation and test results of the final models
Classification accuracy metrics on the CNN-test set are shown for all the final CNN models in Table 1. The three highest values were obtained by training the models directly from Imagenet-pretrained weights. The differences between the three best CNN architectures are negligible. Only Alexnet showed slightly poorer performance in terms of classification accuracy. To support the interpretation of the results also validation accuracies are reported in Table 2. A more detailed comparison between true and predicted classifications on CNN-test can be seen for the most accurate top-3 models in Fig 4. All the models performed well in detecting the tumor class but had slight difficulties distinguishing between classes stroma and other.
Precision, recall and F 1 -score are shown in Table 3. The numbers show that all the top-3 models attained excellent hit rate (recall > 0.9) for all classes as well as they are accurate in predicting positive cases (precision > 0.9). Consequently all the models attained high F1-score (> 0.9). The observed differences between the top-3 models are so small that they can most likely be explained by random variation.

Tumor-stroma ratio predictions
Distributions of the predicted TSR values on the TSR-test set are shown for the top-3 models in Fig 5. Examples of the most accurate and least accurate TSR predictions on WSIs shown in S1 Fig.   Mean, median, standard error of the estimate (SEE), standard deviation (std) of predicted TSR values and the number of WSIs in each category are shown in Table 4. The results show that all the models overestimate the TSR value in the three lowest target categories (10%, 20% and 30%) and underestimate in the remaining target categories. All the models perform best in the category 60% (the mean differences between the predicted and true TSR values 1.4) whereas the poorest performance is observed in the categories 10% and 80% (the mean differences between the predicted and true TSR values 14.4 and 14.2 respectively). The smallest  SEEs are observed in categories 30%, 40% and 90%. When grouping the TSR values into stroma-high (TSR > 50%) and stroma-low (TSR � 50%), the Cohen's kappa scores between the true and predicted TSR values were 0.32 (SETUP 2 / Googlenet), 0.33 (SETUP 2 / ResNet50) and 0.33 (SETUP 2 / VGG19).

Discussion
The aim of this study was to investigate how accurately CNN-based machine learning models can predict the ratio of tumor and stroma tissue in WSI samples. For the best model (ResNet50 architecture) the correlation between the true and predicted TSR values was 0.57 (SEE = 18.6). Approximately the same performance was obtained with GoogleNet and VGG19 architectures. Utility of domain-specific pre-training of CNN models was also investigated, but no meaningful differences were observed. The results show that the same or even better performance with comparable computational cost can be achieved in the present task without domain-specific pre-training of CNN. Even though the automated TSR were predicted from the whole tissue area in contrast to the pathologist who selected a small spot for estimation, their outcomes correlates (r = 0.57) rather well. Cohen's kappa score (κ = 0.33) for stroma-high and stroma-low classification was comparable with results from the previous study by Geessink et al. [28] where Cohen's kappa score for stroma-high and stroma-low with 50% cutoff-value was 0.24. An overview by [22] show that the inter-observer kappa-values variate between 0.60 to 0.89 when pathologists estimate the TSR of CRC binary into stroma-high and stroma-low.
In this study SEE was lowest in TSR categories 30%, 40% and 90%. When comparing the means of the predicted TSR values with the true TSR values, the best performing set of images was the one with TSR 60% and the second largest number of samples. Balancing the data for TSR prediction task might bring more consistency to the results.
The results also indicate that the most difficult part of the classification task is to separate stroma and other. This can be due to the visual similarity of smooth muscle and fibrotic stroma. This may cause the classifier to make a mistake since the smooth muscle tissue belongs to the other class. The smooth muscle and stroma are difficult to separate even by the human eye. Despite the minor weaknesses, the classification accuracy of the best CNN models was comparable to previous studies [25,26]. The classification accuracy for the tumor tissue was over 98% with all top-3 models.
Despite the promising performance of machine learning bringing some aspects of the visual estimation procedure could improve the automated models. For example, only tumor-related stroma tiles could be taken into account. Another option could be mimicking the visual human process by going through the image frame by frame and choosing one spot for making the final TSR prediction. In addition to these, using a smaller tile size might increase classifying accuracy since some tumor areas, as well as stromal areas in between, seem to be quite narrow. This could have a significant effect on the quality of TSR predictions.
Even though accuracy of the proposed automated method for TSR estimation does not fully compare to human visual analysis, the reproducibility of computational model outcomes is a major advantage. Automated machine learning based tools would bring reproducibility to daily practices and the TSR estimation process, in particular. Moreover, TSR estimation with an automated machine learning model can be completed in a fraction of the time compared to the visual method.
This study has some limitations. Firstly, the aim of this research was to develop models on the Finnish population and, therefore, their generalizability to other populations can not be guaranteed without further research. When interpreting the results, it is important to consider the size of test data. All the test results are, however, produced by trying the models only once on the independent random test set which guarantees that the models were not overfit to the test samples. In order to take the models into clinical use more extensive external tests are needed.
As visually estimated TSR has been shown to be an independent prognostic factor in solid cancer types [18][19][20][21], further studies should take place for assessing the correlation of the machine learning predicted TSR values with other clinicopathological factors and the overall survival of patients. If a correlation is found, the model should be optimized to perform better in terms of generalizability. Also a "hotspot"-analysis of TSR, in which a spot where to estimate the TSR would be manually chosen, should be considered as it would ease the computational burden of DL based method and thereby improve the prediction accuracy.
Supporting information S1 Table. Hyperparameters. Chosen parameters in the final training phase of all models. LR refers to the learning-rate and optimizer is the optimization-function used. Parameters were chosen using 5-fold cross-validation. Loss-function was cross-entropy for all setups.